In [1]:
%pylab inline
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None
In [4]:
#Create a dataframe called twitter data from the CSV file
#Note if this is breaking your machine there is a smaller data set in the data file called twitter1-small.csv
twitterData = pd.read_csv( '../../data/twitter1.csv', encoding='iso8859_15' )
In [5]:
tweetCounts = twitterData['Username'].value_counts()
tweetCounts.head(10)
Out[5]:
In [6]:
twitterSummary = twitterData[['Username', 'Friends', 'Followers']]
twitterSummary['ffratio'] = twitterSummary['Friends'] / twitterSummary['Followers']
twitterSummary.head()
Out[6]:
In the Data folder, there is a spreadsheet called studentData.csv consisting of students and test scores. Write a script which calculates each students' average test score and adds that as a column to the DataFrame. The first person to raise their hand and tell me which student has the highest average test score, and what it is wins something.
In [7]:
studentData = pd.read_csv('../../data/studentData.csv')
studentData['average'] = studentData[['Test1', 'Test2', 'Test3', 'Test4', 'Test5']].mean(axis=1)
studentData.sort_values('average', axis=0, ascending=False )
Out[7]:
Using the twitter data, find all the users with Facebook accounts and create a new column called FacebookID which contains the users' Facebook ID. As you can see in the URL below, a user's Facebook ID can be found in the URL column, http://www.facebook.com/profile.php?id=5141860. Extract this by using the str.extract function. Don't forget to remove all the invalid or empty IDs.
We've already created a DataFrame for you in the cell above.
In [8]:
newData = twitterData[ twitterData['URL'].fillna("").str.contains('facebook') ]
newData['FacebookID'] = newData['URL'].str.extract( 'profile.php\?id=(\d+)', expand=False)
newData.dropna( inplace=True )
In [10]:
newData.head()
Out[10]:
In [ ]: